Lost Connections: Assessing the Impact of Retractions in Academia¶

Introduction¶

The recent buzz in academia involved two cases of academic fraud. One was against Francesca Gino, a behavioral scientist and top Professor at Harvard University. Ironically, her research was about honesty. The second case was against Marc Tessier-Lavigne, neuroscientist and former President of Standford University. He stepped down due to the allegations. While the media focused on the reputational damage suffered by these academic personalities, I wondered how many papers were affected by the eventual retractions of their papers. A search in Google Scholar shows that the Gino's most cited retracted paper received 527 citations; the number is 737 for Tessier-Lavigne. These citations extend to other researchers who referenced their work, and this ripple effect continues as these secondary citers themselves become primary sources for subsequent researchers.

I thought this issue was a good opportunity to put into context some basic network science concepts. In the process, I will show you how to write code that calculates and visualizes descriptive quantities from network data. In terms of programming knowledge, I assume you know how to run codes in a Python integrated development environment (IDE) like JupyterLab or Google Colab, and how to perform basic programming tasks (control flow statements, creating functions, manipulating data structures, importing libraries) in Python. In terms of math knowledge, I assume you have basic knowledge of functions and matrices.

Background and Basics¶

Network, Nodes, Edges¶

A network is a system of interconnected entities. This broad definition allows network science to permeate all disciplines and industries. Here are some examples:

  • The social network of Facebook consists of people or their accounts. Two people are considered connected if they are Facebook friends. Here, the connection is mutual.
  • The informational network of the World Wide Web consists of web pages. Two pages are connected if one is accessible from the other through a hyperlink. There is a sense of direction of connection from source to destination.
  • A transportation network consists of landmarks connected by routes. In road networks, the strength of connection can express the number of lanes between two landmarks.
  • A patient-disease network is a biological network that connects patients to their diseases. There are two kinds of entities in the network: patients and diseases. Patient-patient or disease-disease connections do not occur.

The simplest network is a system of two connected entities. Mathematically, this is represented by two nodes connected by an edge:

image.png

A graph refers to the mathematical representation of a network, but you will hear graph and network used interchangeably. A graph can be as simple as this:

image.png

and as complex as this:

image.png

This is a co-occurrence graph of the book "A Game of Thrones" by George R. R. Martin. Characters are connected if their names appear within 13 words of each other in the novel. The thickness of the link represents the number of times they co-occur. Data obtained from [Kaggle](https://www.kaggle.com/datasets/mmmarchetti/game-of-thrones-dataset) and plotted using [Gephi](https://gephi.org/).

Directed graphs, weighted graphs, bipartite graphs¶

The examples above touched upon types of networks. Let us make the distinctions explicit:

  1. Directed graphs - the connections/edges have the information of direction
  2. Weighted graphs - the connections/edges have a numerical value, usually pertaining to the strength of connection
  3. Bipartite graphs - there are two kinds of entities/nodes and a restriction that an entity cannot connect with its own kind

image.png

Descriptive quantities¶

How do you describe networks? What can be calculated from a graph that translates to real-world insights? Such quantities will be explored in the context of a citation network: a directed, unweighted, and non-bipartite network. In citation networks, a node represents a scientific paper and a directed edge represents a citation from one paper to another. By calculating these quantities, we gain a better understanding of the effects of retraction on the academic publishing system.

Directed-citation-network-Nodes-represent-papers-in-the-corpus-Directed-edges-represent.png

This is a citation network. Circles represent research papers and arrows represent citations. For example, first author Sung was cited by the sources of the arrows pointing to it, such as Khoury and Wang. Source: [researchgate.net](https://www.researchgate.net/figure/Directed-citation-network-Nodes-represent-papers-in-the-corpus-Directed-edges-represent_fig2_313265540)

Data source¶

The data comes from SNAP Datasets, a collection of large-scale network data stored in text format. We will use the High-energy physics theory citation network page which contains a citation network of theoretical high energy physics papers in the open-access hub Arxiv.

image.png

Screenshot from the High-energy physics theory citation network dataset.

There are three downloadable files in the Files section of the page. We will only use cit-HepTh.txt.gz. To open this compressed file, we use the gzip module from Python's standard library:

In [1]:
import gzip

input_file = r"C:\Users\63926\Desktop\cit-HepTh.txt.gz"
output_file = r"C:\Users\63926\Desktop\cit-HepTh.txt"

with gzip.open(input_file, 'rb') as f_in:
    file_content = f_in.read()

with open(output_file, 'wb') as f_out:
    f_out.write(file_content)

The code saves the contents of the compressed file as cit-HepTh.txt. Make sure you edit the input_file and output_file strings to match the file paths in your device. The text file leads with four lines that describe the dataset followed by tab-separated numbers (see screenshot below). The first column gives the ID for the citer, while the second column refers to the referenced paper. This data structure is called an edge list.

image.png

Screenshot of the contents of the text file.

Let us put the data in a NumPy array. For first-time users, NumPy is a Python library for manipulating arrays and matrices. You may install it using

In [2]:
!pip install numpy
Requirement already satisfied: numpy in c:\users\63926\anaconda3\lib\site-packages (1.23.5)

We then import the library and load the data using the np.loadtxt function. The argument comments='#' removes the first few lines of the text file, leaving only the ID values.

In [3]:
import numpy as np

edge_list = np.loadtxt(gzip.open(input_file, 'rb'), dtype=int, comments='#')
edge_list
Out[3]:
array([[   1001, 9304045],
       [   1001, 9308122],
       [   1001, 9309097],
       ...,
       [9912286, 9808140],
       [9912286, 9810068],
       [9912286, 9901023]])

Advanced analytics¶

The file cit-HepTh-dates.txt.gz contains the paper submission dates. This data is useful for more advanced explorations involving time dynamics. The file cit-HepTh-abstracts.tar.gz contains the abstracts of the papers, and could serve as an entry point for natural language processing and topic modeling.

Analysis and Insights¶

We need two additional Python libraries: one to compute descriptive quantities and another to visualize them. For these tasks, we will install and import the NetworkX and Matplotlib libraries:

In [4]:
!pip install networkx
!pip install matplotlib
Requirement already satisfied: networkx in c:\users\63926\anaconda3\lib\site-packages (2.8.4)
Requirement already satisfied: matplotlib in c:\users\63926\anaconda3\lib\site-packages (3.7.0)
Requirement already satisfied: numpy>=1.20 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.23.5)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.0.5)
Requirement already satisfied: pillow>=6.2.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: packaging>=20.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (23.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: cycler>=0.10 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\63926\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
In [5]:
import networkx as nx
from matplotlib import pyplot as plt

To transform the edge list to a NetworkX graph object, I used the nx.DiGraph function to create an empty directed graph. Iterating over the entries of the edge list, I added each entry to the empty graph by applying the DiGraph.add_edge method. Despite its name, this function also adds the nodes together with the edge to the graph.

In [6]:
G = nx.DiGraph()

for from_node, to_node in edge_list:
    G.add_edge(from_node, to_node)

Size, order, network scale¶

The size and order of a network refer to the number of nodes and edges, respectively. Using the DiGraph.order and DiGraph.size methods reveal 27,770 research papers and 352,324 citations in the citation network, which puts it in the realm of large-scale networks.

In [7]:
N = G.order()
L = G.size()

print ('Number of nodes: ', N)
print ('Number of edges: ', L)
Number of nodes:  27770
Number of edges:  352807

Visualizing the graph¶

NetworkX is not equipped to handle the visualization of large networks. To see the citation graph, we use the visualization tool Gephi. I will not discuss how to use the program. I recommend this article for a tutorial.

My mid-spec laptop could not properly render the visuals for all the nodes, so I filtered them to include only those papers that received 100 citations or more. The size and hue of the circles denote the number of citations. It is difficult to extract insights from such a convoluted graph. This is why descriptive quantities are important: they give us numerical bases for qualitative characteristics.

Untitled.png

Adjacency matrix, symmetric matrix, self-loops¶

Another way of mathematically representing a network apart from the edge list is through an adjacency matrix, a square matrix consisting of zero and non-zero elements. The rows and columns of the matrix represent nodes such that the matrix element $A_{ij}$ gives the existence of an edge from node $i$ to node $j$. An adjacency matrix element can be 0 if there is no edge between the nodes, 1 if there is an edge, and in weighted graphs, greater than 1 to denote edge weight.

For undirected graphs, an edge from node $i$ to node $j$ implies the same edge from node $j$ to node $i$, that is, the connection is mutual. This means $A_{ji}=A_{ij}$, a property of a so-called symmetric matrix. This flipping of $i$ and $j$ can be performed in NumPy using the .T attribute.

Non-zero elements $A_{ii}$ along the diagonal of the matrix imply a connection of an entity to itself; these are called self-loops. The diagonal of a NumPy matrix can be called using the np.diag function.

image.png

We generate the adjacency matrix using the nx.adjacency_matrix function.

In [8]:
A = nx.adjacency_matrix(G).toarray()
print(f'Shape of adjacency matrix (rows, columns): {A.shape}')
print('Preview of adjacency matrix:')
A
C:\Users\63926\AppData\Local\Temp\ipykernel_14908\1915626862.py:1: FutureWarning: adjacency_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.
  A = nx.adjacency_matrix(G).toarray()
Shape of adjacency matrix (rows, columns): (27770, 27770)
Preview of adjacency matrix:
Out[8]:
array([[0, 1, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)

The number of rows and columns of an adjacency matrix must be equal to the order. The shape computed by the .shape attribute above supports this.

By checking the matrix elements, we identify some graph properties:

  • Check for matrix values greater than 1 to identify if the graph is weighted.
  • Check if the matrix is symmetric to identify if the graph is undirected.
  • Check if the sum of diagonal elements is non-zero to detect the presence of self-loops.

In the code below, I used np.any to check if any elements are greater than 1. I used .T to transpose the matrix and np.all to check if the matrix is equal to its transpose for all elements. Finally, I used np.diag to get the diagonal elements and np.any to check if any diagonal elements are non-zero.

The code confirms that we have an undirected, unweighted citation network. However, the code also confirms the uncharacteristic presence of self-loops which would mean that a paper referenced itself. Utilizing the .sum() method gives a total of 39 self-loops, rare enough to be disregarded.

In [9]:
# Runtime on mid-spec laptop: less than 1 min
print(f"Weighted? {'Yes' if np.any(A > 1) else 'No'}")
print(f"Directed? {'Yes' if np.all(A == A.T) else 'No'}")
print(f"Self-loops? {'Yes' if np.any(np.diag(A) > 0) else 'No'}")
print(f"Number of self-loops: {(np.diag(A) > 0).sum()}")
Weighted? No
Directed? No
Self-loops? Yes
Number of self-loops: 39

Degree, in-degree, out-degree¶

Network size and order tell us that there are 352807 edges / 27770 nodes $\approx$ 13 edges per node on average. Is this enough to describe the density of connections? Not quite, because there will be papers that will get more citations than others. It would be nice to know how many papers have an impressive number of citations and how many get a mediocre, or even zero, number of citations.

The degree of a node refers to the number of edges connected to a node. It also refers to the number of node neighbors. For directed graphs, there is a further distinction between in-degree and out-degree, referring to the number of edges pointing towards and away from a node, respectively.

image.png

We use the DiGraph.degree, DiGraph.in_degree, and DiGraph.out_degree attributes to generate the degrees, in-degrees, and out-degrees of all nodes. We use several NumPy functions to calculate the minimum, maximum, and average values. We also use np.histogram and plt.bar to plot the histogram of degree values. Feel free to increase the value of the bins parameter to make distributions smoother.

In [10]:
ks = list(dict(G.degree).values())

k_min = np.min(ks)
k_max = np.max(ks)
k_avg = np.mean(ks)

print(f'Minimum degree: {k_min}')
print(f'Maximum degree: {k_max}')
print(f'Average degree: {k_avg:.3f}')

hist_values, bin_edges = np.histogram(ks, bins=200)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('degree')
plt.ylabel('count')
plt.title('Histogram of node degrees');
Minimum degree: 1
Maximum degree: 2468
Average degree: 25.409
In [11]:
ks = list(dict(G.in_degree).values())

k_min = np.min(ks)
k_max = np.max(ks)
k_avg = np.mean(ks)

print(f'Minimum in-degree: {k_min}')
print(f'Maximum in-degree: {k_max}')
print(f'Average in-degree: {k_avg:.3f}')

hist_values, bin_edges = np.histogram(ks, bins=200)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('in-degree')
plt.ylabel('count')
plt.title('Histogram of node in-degrees');
Minimum in-degree: 0
Maximum in-degree: 2414
Average in-degree: 12.705
In [12]:
ks = list(dict(G.out_degree).values())

k_min = np.min(ks)
k_max = np.max(ks)
k_avg = np.mean(ks)

print(f'Minimum out-degree: {k_min}')
print(f'Maximum out-degree: {k_max}')
print(f'Average out-degree: {k_avg:.3f}')

hist_values, bin_edges = np.histogram(ks, bins=200)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('out-degree')
plt.ylabel('count')
plt.title('Histogram of node out-degrees');
Minimum out-degree: 0
Maximum out-degree: 562
Average out-degree: 12.705

Degree distribution, small world, power law¶

According to complexity science, some networks have small-world properties that lie somewhere between the properties of regular networks and random networks. One such property stemmed from a paper from Barabási and Albert about the discovery of the power law degree distribution of the World Wide Web, challenging the previous belief that networks tend to have normal (bell curve) degree distributions. In power-law distributed networks, entities with few connections are highly common, while entities with numerous connections are relatively rare.

image.png

A normal (bell curve) distribution and a power law distribution.

The in-degree gives the number of citations of a paper. The in-degree distribution of the citation network follows a power law$^\dagger$ which means very few papers would cause a significant retraction impact on primary citers. This insight ignores the effect of retraction on the secondary citations (i.e. those who cited the primary citers) and other papers across the citation web. As a very artificial but plausible example, what if the retracted paper was cited only once but the citer was cited a thousand times? Did the low in-degree reflect the importance of the node? This calls for a more sophisticated metric that would better quantify the effects of retraction.

$^\dagger$It turns out that proving that a distribution follows power law is not as straightforward as one would expect. The Appendix shows how to use the powerlaw Python module to test for goodness of power-law fit.

Path length, strongly connected, weakly connected¶

A path is a chain of edges connecting two nodes. For a directed graph, you can only move according to the direction of the arrow. The path length between two nodes in an unweighted graph is the number of edges that connect the nodes.

This quantity is defined only if there is at least one path connecting any pair of nodes. A network that satisfies this is called a strongly connected network. The nx.is_strongly_connected function confirms that the citation network is not strongly connected. This means that there are papers that are not connected by a trail of citations. What if we allow crossing a directed edge in reverse? If there is now at least one path connecting any pair of nodes, then the network is weakly connected. The nx.is_weakly_connected function confirms that the citation network is not weakly connected.

What do these results tell us about the citation network? It means there is a limit to the impact of retracted papers. The trace of citation will encounter unreachable nodes; these are papers that will be unaffected by the retraction.

In [13]:
print(f"Strongly connected? {'Yes' if nx.is_strongly_connected(G) else 'No'}")
print(f"Weakly connected? {'Yes' if nx.is_weakly_connected(G) else 'No'}")
Strongly connected? No
Weakly connected? No

Clustering coefficient¶

The clustering coefficient of a node is the proportion of the node's neighbors that are connected to each other. Think about that for a bit. The higher the clustering coefficient, the more the node and its neighbors are interconnected.

We use a code similar to that of the degree distribution to generate the clustering coefficient histogram below. The graph shows generally low clustering coefficients, with a significant number of nodes having zero clustering coefficient and very few (it is not even visible in the histogram anymore) cases of high clustering coefficient. The distribution is closer to a normal distribution than a power law.

The fact that most values fall on the lower end of the clustering coefficient spectrum means that there are no tightly connected groups. In such groups, when one paper is retracted, the rest of the papers within its group will be affected. If a paper with a high clustering coefficient is retracted, its citers who not only cited the retracted paper but also cited each other in the group face a greater risk of correction or retraction. Most of the nodes in the citation network do not have this problem.

In [14]:
cc = list(dict(nx.clustering(G)).values())

cc_min = np.min(cc)
cc_max = np.max(cc)
cc_ave = np.mean(cc)

print(f'Minimum clustering coefficient: {cc_min:.3f}')
print(f'Maximum clustering coefficient: {cc_max:.3f}')
print(f'Average clustering coefficient: {cc_ave:.3f}')

hist_values, bin_edges = np.histogram(cc, bins=100)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('clustering coefficient')
plt.ylabel('count')
plt.title('Histogram of clustering coefficients');
Minimum clustering coefficient: 0.000
Maximum clustering coefficient: 1.000
Average clustering coefficient: 0.157

Degree centrality, betweenness centrality, closeness centrality¶

There are several metrics that attempt to quantify the importance of a node. For example, the degree centrality is the node degree multiplied by some normalization constant. This metric equates importance with the number of neighbors.

Another metric is called the betweenness centrality, which measures the number of shortest paths that pass through a given node (the general formula is more complicated but the simplified interpretation works in our case). This metric equates importance with performing the role of a "bridge" that connects numerous nodes and clusters.

Closeness centrality is the reciprocal of the sum of shortest paths from a node to all other nodes. This metric equates importance with the reachability of the node.

NetworkX can no longer handle the centrality computations due to our dataset's size (except for the degree centrality, which is more or less the same as the degree). So we turn to NetworKit, a high-performance Python library designed for large networks. I've listed my first impressions below.

image.png

One disadvantage of NetworKit is the lack of integration with NumPy. Hence we can't use the edge list and adjacency matrix that we saved earlier. Luckily, NetworKit can read the text files from SNAP; just use the nk.Format.SNAP argument in the nk.readGraph function.

In [15]:
!pip install cython
!pip install networkit
Requirement already satisfied: cython in c:\users\63926\anaconda3\lib\site-packages (3.0.0)
Requirement already satisfied: networkit in c:\users\63926\anaconda3\lib\site-packages (10.1)
Requirement already satisfied: numpy in c:\users\63926\anaconda3\lib\site-packages (from networkit) (1.23.5)
Requirement already satisfied: scipy in c:\users\63926\anaconda3\lib\site-packages (from networkit) (1.10.0)
In [16]:
import networkit as nk

H = nk.readGraph(r"C:\Users\63926\Desktop\Cit-HepTh_1.txt", nk.Format.SNAP)

Algorithms for computing betweenness and closeness centralities are $O(N^2)$ in time, meaning the runtime increases according to the square of the network order. This is significant for large-scale networks, including our $N$ = 27,770 citation network. An important advantage of NetworKit is it has algorithms that give approximate centrality values in an $O(L)$ time complexity, where $L$ is the network size or number of edges. NetworKit runs these algorithms many times and takes the average of the results to get more robust values.

Using the nk.centrality.EstimateBetweenness function with 10000 trials, we get the approximate histogram of betweenness centralities shown below. In my mid-spec laptop, the code ran for 5 minutes. You may decrease the number of trials if it is taking more time on your device.

In [17]:
B = nk.centrality.EstimateBetweenness(H, 10000)
B.run()
Bs = B.scores()

B_min = np.min(Bs)
B_max = np.max(Bs)
B_ave = np.mean(Bs)

print(f'Minimum beteweenness centrality: {B_min:.1f}')
print(f'Maximum beteweenness centrality: {B_max:.1f}')
print(f'Average beteweenness centrality: {B_ave:.1f}')

hist_values, bin_edges = np.histogram(Bs, bins=100)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('betweenness centrality')
plt.ylabel('count')
plt.title('Histogram of betweenness centralities');
Minimum beteweenness centrality: 0.0
Maximum beteweenness centrality: 63873156.0
Average beteweenness centrality: 88402.9

The histogram looks like a very steep power law distribution. The trend becomes more apparent when we remove the values above the 99th percentile:

In [18]:
Bs_99 = np.array(Bs)[Bs < np.percentile(Bs, 99)]

hist_values, bin_edges = np.histogram(Bs_99, bins=100)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('betweenness centrality')
plt.ylabel('count')
plt.title('Histogram of betweenness centralities below the 99th percentile');

In the context of the data, the betweenness centrality distribution indicates that a few papers serve as crucial links between different clusters of publications, possibly representing subdisciplines in high energy physics. Retracting papers with high betweenness centrality would significantly impact papers from different subdisciplines.

Using the nk.centrality.ApproxCloseness function with 10000 trials, we get the approximate histogram of betweenness centralities shown below. In my mid-spec laptop, the code ran for 2 minutes. The extra lines of code remove infinite closeness centrality values (these values are from nodes that are completely disconnected) and outlier values close to the maximum value of 1.0.

In [19]:
C = nk.centrality.ApproxCloseness(H, 10000)
C.run()
Cs = C.scores()
Cs = np.array(Cs)[~np.isnan(Cs)]

Cs = Cs[Cs < np.percentile(Cs, 99)]

C_min = np.min(Cs)
C_max = np.max(Cs)
C_ave = np.mean(Cs)

print(f'Minimum closeness centrality: {C_min:.5f}')
print(f'Maximum closeness centrality: {C_max:.5f}')
print(f'Average closeness centrality: {C_ave:.5f}')

hist_values, bin_edges = np.histogram(Cs, bins=100)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('closeness centrality')
plt.ylabel('count')
plt.title('Histogram of closeness centralities');
Minimum closeness centrality: 0.00000
Maximum closeness centrality: 0.00001
Average closeness centrality: 0.00001

Aside from a few outliers, the closeness centralities are very small, indicating a large number of edges separating any two nodes on average. In the context of citation networks, a paper with low closeness centrality would be distantly related to most papers in the network. If the impact of retraction diminishes as we trace along the trail of citations, then low closeness centralities suggest low overall retraction impact.

Eigenvector centrality¶

The eigenvector centrality measures the importance of a node by taking into account the degree of the node and its neighbors. The calculation involves solving for the eigenvector of the adjacency matrix that yields the largest eigenvalue. We would have to spend a long time studying eigenvalue and eigenvector theory to really understand the significance of this eigenvector. We simplify by saying that the computation of the eigenvector involves looking at the degree of the node in question, then the degree of its neighbors, then the degree of its neighbors' neighbors, and so on. This sounds like the perfect metric for our objective.

Using the nk.centrality.EigenvectorCentrality function, we get the distribution of eigenvector centralities below. It seems to follow a power law, meaning very few papers have a significant effect on primary citations as well as other papers along the web of citation. Comparing the maximum value ($\approx$0.25) with the maximum possible value which is 1, we can safely say that the impact of retractions on the papers will be moderate at most.

In [20]:
E = nk.centrality.EigenvectorCentrality(H)
E.run()
Es = E.scores()

E_min = np.min(Es)
E_max = np.max(Es)
E_ave = np.mean(Es)

print(f'Minimum eigenvector centrality: {E_min:.5f}')
print(f'Maximum eigenvector centrality: {E_max:.5f}')
print(f'Average eigenvector centrality: {E_ave:.5f}')

hist_values, bin_edges = np.histogram(Es, bins=100)
plt.bar(bin_edges[:-1], hist_values, width=np.diff(bin_edges), align='edge')
plt.xlabel('eigenvector centrality')
plt.ylabel('count')
plt.title('Histogram of eigenvector centralities');
Minimum eigenvector centrality: 0.00000
Maximum eigenvector centrality: 0.25456
Average eigenvector centrality: 0.00246

Insights and Reflection¶

Let us summarize the insights:

The Arxiv High-energy physics theory citation network is a large-scale network with a power-law degree distribution. As such, there are very few influential papers that act as hubs of scholarly information. The retraction of a hub would cause an immediate wave of scrutiny for a large number of primary citers. However, both the presence of disconnected nodes and the low closeness centrality values indicate limited reachability.

The low clustering coefficient values mean either network has a low tendency to form clusters or the connections of the clusters are weak. Given the power law degree distribution, I suspect the presence of weakly connected clusters centered around the hubs. These are clusters that go beyond immediate citations. Identifying the clusters would show the scope of long-term effects of retraction. The clustering techniques are beyond the scope of this article. To motivate you to do further reading, imagine the benefits Arxiv or any other journal would gain if they have a clear picture of which group of papers to track in cases of retraction.

The eigenvector centrality provides the most accurate measurement of the effect of a paper retraction out of all metrics we have tried. It tells the same story as the degree distribution: a power law distribution where most papers would have a negligible retraction impact and a select number of influential papers would have a huge retraction impact. Another possible measure of influence is its value as a bridge between subdisciplines, given by the betweenness centrality. The distribution is yet another power law, in agreement with the other centralities.

I realized while working on this report that it's difficult to assess network metrics in a vacuum. What is considered small or large would depend on the context of the values and comparisons with similar data. I was not able to show how the descriptive quantities compare with, say, other Arxiv categories or other journals. That is something that I could write in a future article and something that you can try out as well.

Conclusion¶

Congratulations on finishing the article. For me, this was more of an exercise in teaching network science concepts rather than an exposition of state-of-the-art techniques. I tried to write it in a way that even beginners would understand, yet still have something for those who are looking for something beyond the basics. I hope you learned something from it.

Appendix¶

A very crude approach to check if a distribution follows a power law is to convert the histogram to a log-log scale. In this scale, the distribution, or at least its tail end, should fit a line. This paper details the correct way to check for the goodness of fit of a power law in a distribution. The gist is that the goodness of fit of a power law should always be in comparison with similar distributions and under a statistical test. The powerlaw library implements the procedure detailed in the paper.

Fit a powerlaw on an array of raw values using the powerlaw.Fit class. Then use the powerlaw.Fit.supported_distributions and powerlaw.Fit.distribution_compare methods to compare the goodness of fit of the power law in the distribution compared to other decaying functions. The comparison involves a Kolmogorov-Smirnov test that returns an R statistic and a p-value. A positive R statistic means the distribution is more likely to follow power law, while a negative R statistic means the distribution is more likely to follow the other distribution. The magnitude of the R statistic is proportional to the discrepancy. A p-value of less than 0.05 means stronger evidence of discrepancy.

In [21]:
import powerlaw

results = powerlaw.Fit(ks, discrete=False)

for i, dist in enumerate(results.supported_distributions.keys()):
    R, p = results.distribution_compare('power_law', dist)
    print(f'{dist}: R = {R}, p = {p}')
Values less than or equal to 0 in data. Throwing out 0 or negative values
Assuming nested distributions
Calculating best minimal value for power law fit
power_law: R = 0.0, p = 1.0
lognormal: R = 0.05524021042689009, p = 0.13614257009638084
exponential: R = 93.95461144048224, p = 3.087058081239643e-06
truncated_power_law: R = 0.00013562365521879727, p = 0.9868597645039232
stretched_exponential: R = 7.257282762115102, p = 0.0014386650390882088
lognormal_positive: R = 7.455612678203015, p = 0.00143938151009422
C:\Users\63926\anaconda3\lib\site-packages\powerlaw.py:1615: RuntimeWarning: invalid value encountered in divide
  CDF = CDF/norm
'nan' in fit cumulative distribution values.
Likely underflow or overflow error: the optimal fit for this distribution gives values that are so extreme that we lack the numerical precision to calculate them.
Assuming nested distributions

The p-value is 1 for the trivial case of comparing a power law to itself or to a truncated version. The p-value is much less than 0.05 when comparing power law with the exponential function, with the R statistic heavily favoring a power law fit. Power laws decay slower than expoential functions, so we do expect a discrepancy in how well they fit the data. The rest of the functions belong to the same family of heavy-tailed distributions. The p-values are still less than 0.05 for stretched expoential and log-normal positive distributions with the R statistic favoring a power law fit. Only the comparison with log-normal yielded an inconclusive result (p > 0.05). These findings establishes our confidence in the goodness of fit of the power law.

References¶

  1. Kaiser, J. (2023, July 18). After honesty researcher's retractions, colleagues expand scrutiny of her work. Science Magazine. https://www.science.org/content/article/after-honesty-researcher-s-retractions-colleagues-expand-scrutiny-her-work
  2. Stanford Daily. (2023, July 19). Stanford president resigns over manipulated research, will retract at least 3 papers. https://stanforddaily.com/2023/07/19/stanford-president-resigns-over-manipulated-research-will-retract-at-least-3-papers/
  3. Shu, L. L., Mazar, N., Gino, F., & Bazerman, M. H. (2012). Signing at the beginning makes ethics salient and decreases dishonest self-reports in comparison to signing at the end. Proceedings of the National Academy of Sciences, 109(38), 15197-15200. DOI: 10.1073/pnas.1209746109.
  4. Nikolaev, A., McLaughlin, T., O'Leary, D. D. M., & Tessier-Lavigne, M. (2009). APP binds DR6 to trigger axon pruning and neuron death via distinct caspases. Nature, 457(7232), 981-989. PMID: 19225519. PMCID: PMC2677572. DOI: 10.1038/nature07767.
  5. The Network Science Data Repository. (2003). Cit-HepTh. Retrieved from Stanford Large Network Dataset Collection: https://snap.stanford.edu/data/cit-HepTh.html.
  6. Barabási, A.-L., & Albert, R. (1999). The diameter of the world wide web. arXiv preprint cond-mat/9907038.
  7. Hagberg, A., Swart, P., & S Chult, D. (2021). NetworkX Documentation. Retrieved from https://networkx.org/documentation/stable/.
  8. Staudt, C. L., & Sazonovs, A. (2021). NetworKit Documentation. Retrieved from https://networkit.github.io/.
  9. Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2007). Power-law distributions in empirical data. arXiv preprint arXiv:0706.1062.
  10. powerlaw (Version 1.4.6). (n.d.). Python Package for Power-law Distributions. Retrieved from https://pypi.org/project/powerlaw
  11. Kolmogorov-Smirnov test. (n.d.). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test